Domain: Semiconductor manufacturing process



CONTEXT : A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

DATA DESCRIPTION : sensor-data.csv : (1567, 592)
The data consists of 1567 examples each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.


PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.



Steps and tasks: [ Total Score: 60 points]


1. Import and explore the data.

  • For past data Time is stored as an object, For Future Predictions data the time is in correct format.
  • All the features are numeric in nature.
  • For pdata all the features have float datatype
  • For fdata some features are in int and some are in float datatype.

  • 2. Data cleansing:
    • Missing value treatment.
    • Drop attribute/s if required using relevant functional knowledge.
    • Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

    3. Data analysis & visualisation:
    • Perform detailed relevant statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

    NOTE: Both Steps 2 and 3 are performed together in the following code, since the data is dominantly numeric in nature.

    So, 538 columns contain missing values. let's see the highest amounts of missing values in a column:

  • Some features are almost empty for the past data.
  • For future data some columns are completely empty

  • Remark: There is big jump after 17.4% to 45.6%. Generally features with more than 35% missing data, do nto offer much value in prediction

    Large number of zeros are present. Many features have only 1 value, i.e. 0 throughout

    More than 250 features have extremely low variance (<0.1), thus having minimal contribution in the output

    There are still many columns with null values, we will see more information to remove before dropping all the null values.

    Out of 591 features, now we have 130 features, which will have a significant impact on the model building.

    As we can see from the graph the Column 521 has dominant 0 values.

    Evidently no other feature is heavily dominated by a single value.

    The 5 point summary is not conveying any relevant information as we dont understand the nature of the variables or any of their expected values.

    As we can see there is a major class imbalance in the given dataset and hence we will have to take care of that while we are buliding the model, otherwise we will have biasing towards the majority class ie Failed.

    4. Data pre-processing:
    • Segregate predictors vs target attributes
    • Check for target balancing and fix it if found imbalanced.
    • Perform train-test split and standardise the data or vice versa if required.
    • Check if the train and test data have similar statistical characteristics when compared with original data.

    We've already seen the target imbalancing and we will perform various methods to find the best model.

    Trying various algorithms along with different sampling techniques

    No Sampling</b>

    Random undersampling

    SMOTE

    Random Oversampling

    ADASYN Sampling

    Gaussian Naive Bayes on Normal Dataset

    Gaussian Naive Bayes on Under sampled Data

    LightGBM on SMOTE sampled Dataset

    RandomForest on Random over sampled Dataset

    LightGBM on ADASYN sampled Dataset

    Among the given models, Random Forest with Over Sampled Data gives the best results.

    PCA

    Overall performace
    Best f1 score : xgb without pca
    Best cv score: SVM without PCA
    Best precision : xb with pca
    Best Recall : Logistic regression with pca

    Thus, other than xgb-pca, all other algorithms have almost similar mean cv score.

    To validate for future dataset